Language identification incorporating lexical information
نویسندگان
چکیده
In this paper we explore the use of lexical information for language identification (LID). Our reference LID system uses language-dependent acoustic phone models and phone-based bigram language models. For each language, lexical information is introduced by augmenting the phone vocabulary with the N most frequent words in the training data. Combined phone and word bigram models are used to provide linguistic constraints during acoustic decoding. Experiments were carried out on a 4-language telephone speech corpus. Using lexical information achieves a relative error reduction of about 20% on spontaneous and read speech compared to the reference phone-based system. Identification rates of 92%, 96% and 99% are achieved for spontaneous, read and task-specific speech segments respectively, with prior speech detection.
منابع مشابه
MWU-Aware Part-of-Speech Tagging with a CRF Model and Lexical Resources
This paper describes a new part-of-speech tagger including multiword unit (MWU) identification. It is based on a Conditional Random Field model integrating language-independent features, as well as features computed from external lexical resources. It was implemented in a finite-state framework composed of a preliminary finite-state lexical analysis and a CRF decoding using weighted finitestate...
متن کاملLexical Paraphrasing for Document Retrieval and Node Identification
We investigate lexical paraphrasing in the context of two distinct applications: document retrieval and node identification. Document retrieval – the first step in question answering – retrieves documents that contain answers to user queries. Node identification – performed in the context of a Bayesian argumentation system – matches users’ Natural Language sentences to nodes in a Bayesian netwo...
متن کاملA hierarchical language model incorporating class-dependent word models for OOV words recognition
A new language model is proposed to cope with the demands for recognizing out-of-vocabulary (OOV) words not registered in the lexicon. This language model is a class N-gram incorporating a set of word models that reflect the statistical characteristics of the phonotactics, which depend on the lexical classes. Utilization of class-dependency enhances recognition accuracy and enables identificati...
متن کاملTowards Best Practice for Multiword Expressions in Computational Lexicons
The importance and role of multi-word expressions (MWE) in the description and processing of natural language has been long recognized. However, multi-word information has often been relegated to the marginal role of idiosyncratic lexical information. The need for MWE lexicons grows even more acute for multi-lingual applications, for which (sometimes complex) correspondences must be identified,...
متن کاملComparative Study of Degree of Bilingualism in Lexical Retrieval and Language Learning Strategies
This study compares lexical retrieval amongst monolinguals and intermediate bilinguals and advanced bilinguals. It also investigates the possible effects of their language learning strategies on their respective lexical retrieval advantage. The study used a mixed methods design and the groups consisted of 20 Persian near-monolinguals, 20 Persian-English intermediate level bilinguals, and 20 Per...
متن کامل